2.3 tensorflow单机多GPU并行

#2.3 tensorflow单机多GPU并行| 来源: 网络整理| 查看: 265

现在很多服务器配置都是单机上配有多个GPU卡。tensorflow默认占用全部的gpu的显存，但是只在第一块GPU上进行计算，这样对于显卡的利用率不高。

1. 指定运行GPU，不占用其他gpu的显存。

这种模式就是单卡多任务，一个任务一个卡。

代码语言：javascript复制import os os.environ["CUDA_VISIBLE_DEVICES"] = "0,1" # 指定gpu编号，从0开始

这样可以在不同的卡上运行不同参数的程序，达到调参的目的。

2. 多GPU并行

有时候想要把所有GPU用在同一个模型里，以节省训练时间，方便快速查看结果。这个时候需要用到GPU并行。 gpu并行有模型并行和数据并行，又分为同步和异步模式。单机多卡一般采用同步的数据并行模式：不同gpu共享变量，不同gpu运算不同数据的loss和梯度后在cpu里平均后更新到被训练参数。 tensorflow中的GPU并行策略是（下图，全网都是这个图）：每个GPU中都存有一个模型，但共享所有需要训练的变量。分别在gpu里计算不同batchsize的数据的损失以及损失的梯度，在cpu里收集所有显卡的损失的梯度后求平均值，再更新到变量里。

在这里插入图片描述

对应代码：

代码语言：javascript复制# 计算平均梯度的函数，该函数全网统一 def average_gradients(tower_grads): """Calculate the average gradient for each shared variable across all towers. Note that this function provides a synchronization point across all towers. Args: tower_grads: List of lists of (gradient, variable) tuples. The outer list is over individual gradients. The inner list is over the gradient calculation for each tower. Returns: List of pairs of (gradient, variable) where the gradient has been averaged across all towers. """ average_grads = [] for grad_and_vars in zip(*tower_grads): grads = [] for g, _ in grad_and_vars: expend_g = tf.expand_dims(g, 0) grads.append(expend_g) grad = tf.concat(grads, 0) grad = tf.reduce_mean(grad, 0) v = grad_and_vars[0][1] grad_and_var = (grad, v) average_grads.append(grad_and_var) return average_grads def train_multi_gpu(): global graph # cpu里设置好输入占位符 with graph.as_default(), tf.device('/cpu:0'): # input x = tf.placeholder(tf.float32, shape=[None, SPACE_I_DIMS, SPACE_J_DIMS, SPACE_K_DIMS, 1], name='x') y_ = tf.placeholder(tf.int64, shape=[None, 1]) drop_rate = tf.placeholder(tf.float32) # learning rate global_step = tf.train.get_or_create_global_step() lr = tf.train.exponential_decay( lr0, global_step, decay_steps=lr_step, decay_rate=lr_decay, staircase=True) # optimaizer opt = tf.train.AdamOptimizer(learning_rate=lr) #用列表收集每个gpu的梯度 tower_grad = [] with tf.variable_scope(tf.get_variable_scope()): # gpu for i in range(NUM_GPUS): with tf.device('/gpu:%d' % i): with tf.name_scope('gpu_%d' % i) as scope: with tf.name_scope("tower_%d" % i): #每个gpu里放不同的数据 _x = x[i * batch_size:(i + 1) * batch_size] _y = y_[i * batch_size:(i + 1) * batch_size] # calculate inference y = inference(_x, reuse=False, drop_rate=drop_rate) # loss mse_loss = tf.losses.mean_squared_error(_y, y) cur_loss = mse_loss # 当前梯度 cur_grad = opt.compute_gradients(cur_loss) tower_grad.append(cur_grad) #变量共享 tf.get_variable_scope().reuse_variables() #计算平均梯度 grads = average_gradients(tower_grad) # 更新参数 apply_grident_op = opt.apply_gradients(grads, global_step=global_step) #参数初始化 init = tf.global_variables_initializer() # train steps with tf.Session(config=config).as_default() as sess: # init all variables init.run() try: for i in range(max_steps): # get next step data x_batch, y_batch, id, sex = sess.run(trian_next_batch) #训练 _ = sess.run( [apply_grident_op], feed_dict={ x: x_batch, y_: y_batch, drop_rate: 0.2, })

需要注意的是batchsize的大小，batchsize = batchsize_single_gpu * gpu_nums,例如单gpu的为32，有4块gpu，则总的batchsize为32*4=128.在代码中也很清楚的显示出了tensorflow多gpu并行的原理。

3. 注意事项多gpu并行训练速度会提升，但不是完全线性的，因为gpu之间的通信需要时间。例如单gpu训练100步要50秒，训练了3200个数据，4块gpu并行训练100步可能要150s，但训练数据为3200*4.gpu数量不易选过多，由于前端总线带宽的限制，不同GPU延迟不一样，导致单步时间过长。多卡并行需要的cpu开销很大，所以对服务器的整体性能要求更高一些。如果服务器整体性能不是很好，还是单卡多任务吧。

【本文地址】

公司简介

联系我们